%
% Section for evaluation
%
\section{Evaluation}
\label{sec:Evaluation}

% Subsection for Overview
\subsection{Overview}

The evaluation of ChronoSearch was centered on two commonly measured Information Retrieval performance characteristics: precision and recall. In terms of ChronoSearch, precision is the percent of results retrieved that the user is interested in and are relevant to the query term. Recall is the percentage of events retrieved by the system out of the existing events on the Web. 

In order to measure the precision and recall of the system, tests were performed using three input entities and comparing results from ChronoSearch to existing manually built timelines on the Web. The three input entities used for the evaluation were: Bill Gates, Steve Jobs, and Jim Tressel. For each entity, a truth set was constructed by manually merging timelines that existed on the Web for that entity. For example, manual timelines were constructed from timelines and information that existed on the following websites: CNN, The Telegraph, NPR, and Wikipedia.  To measure recall, the output from ChronoSearch was evaluated against the truth set. Chronosearch’s recall percentage was then compared to the recall percentage of the manually generated timelines found on the Web. The precision of the system was measured as a comparison of the output of the system as currently implemented versus the baseline solution. The baseline solution consisted of the current implementation of ChronoSearch minus the duplicate removal methods and the bad sentence removal method. In other words, the baseline solution of our system did not remove bad sentences due to irregular word lengths, and it did not remove duplicates via the cosine similarity method or verb similarity method.

% Subsection for Recall Evaluation
\subsection{Recall Evaluation}

To measure the recall percentage attained by Chronosearch, the  timelines  produced by ChronoSearch were compared to a manually generated truth set. To objectively select which events belonged in the truth set, the truth set was constructed by manually merging existing timelines already present on the Web. The recall percentage was then calculated as the percentage of events in the truth set that were also found in the timeline being evaluated. The manually built timelines from the Web along with ChronoSearch results were evaluated using this methodology. The results from each timeline were then compared. The results of this comparison are displayed in Figure ~\ref{fig:RecallEvaluation}. There are three different sets of recall measurements, one for each of the person entities searched for in the evaluation. On the left are the results for Steve Jobs, the middle Bill Gates, and on the right Jim Tressel. The recall percentages for ChronoSearch are in blue. The results for Steve Jobs show that ChronoSearch was able to achieve a recall percentage of nearly 75\%. In comparison, the manual timeline from CNET only achieved a recall percentage of around 60\%. The manual timeline from Telegraph did outperform ChronoSearch in this case with a recall percentage of nearly 82\%. The data for Bill Gates is very similar; ChronoSearch had a recall percentage of 73\% whereas the manual timeline on NPR demonstrated a 76\% recall and a personal history report had a 58\% recall. The results for Jim Tressel show the recall percentage of ChronoSearch compared to a truth set generated from Wikipedia only since there were no existing timelines available for Jim Tressel. This data point shows that even when a timeline does not already exist for an entity, ChronoSearch is still successful in constructing a timeline.

Overall, the recall percentage attained by ChronoSearch outperformed the average recall percentage of manually generated timelines present on Web. Figure ~\ref{fig:RecallEvaluation} also shows that ChronoSearch was also able to dynamically generate timelines of the same level of quality in the absence of manually generated timelines on the Web. 


%
% Chart for the Recall Evaluation
%
\begin{figure}
\caption{Recall Evaluation}
\label{fig:RecallEvaluation}
\includegraphics[width=80mm]{RecallEvaluation.png}
\end{figure}

% Subsection for Precision Evaluation
\subsection{Precision Evaluation}

As mentioned earlier, the precision of our system was measured as a comparison of the output of ChronoSearch as currently implemented versus the baseline solution output. The way in which our system improved the precision of the output was to remove candidate events that were duplicates or sentences that did not belong in the output. In order to do this, three removal methods were utilized in our system. The first removal method got rid of bad sentences that had irregular average word lengths outside of the boundary of [3.2, 7.2] characters per word. The second method removed duplicate sentences via the cosine similarity method that had a similarity greater than 50\%. The third and last method removed events that occurred on the same day and had a verb similarity of greater than 50\%. 


The overall precision improvement is shown in Figure ~\ref{fig:PrecisionImprovement}. This was calculated as the percentage of results that were removed from the output excluding results present in the truth set and removals that were later manually classified as false positives. For example, as shown in Figure ~\ref{fig:PrecisionImprovement}, a single run of the baseline for Steve Jobs produced a total of 187 output events, of which 48 events were removed by our removal techniques. However, 10 of the 48 removals were false positives yielding an improvement of 38/187 or 23\%. The average precision improvement of the system was nearly 30\%. 

%
% Chart for the Precision Improvement
%
\begin{figure}
\caption{Precision Improvement}
\label{fig:PrecisionImprovement}
\includegraphics[width=80mm]{PrecisionEvaluation1.png}
\end{figure}

The number of output events removed by each of the three removal methods is shown in Figure ~\ref{fig:EventRemovalStatistics}. For example, 48 candidate events were removed from the output for Bill Gates. Of those 48 events, 7 were removed for having irregular average word lengths, 38 were removed from the cosine similarity mechanism as duplicate events, and 3 were removed for having similar verbs in events that occurred on the same day.

%
% Chart for the Event Removal Statistics
%
\begin{figure}
\caption{Event Removal Statistics}
\label{fig:EventRemovalStatistics}
\includegraphics[width=80mm]{PrecisionEvaluation2.png}
\end{figure}

% Subsection for Correctness Statistics
\subsection{Correctness Measurements}

It is also useful to measure  the accuracy of the removal methods  utilized by ChronoSearch. In order to do this, false positives were identified for the removal techniques utilized by the system. These false positives account for sentences that were removed by one of the three mechanisms described, however, they should not have been removed. The statistics regarding the false positives incurred by the removal methods are shown in Figure ~\ref{fig:FalsePositives}. As an example, it was mentioned that 7 event descriptions  were removed for the Bill Gates run  because they were detected as being bad sentences due to average word length irregularities. Of these 7 events removed, 4 of them were actual events that should not have been removed. For the Bill Gates run of the system, there were no false positives for the cosine similarity removal technique and 1 false positive for the verb similarity removal mechanism. Overall, the ChronoSearch system demonstrated a relatively low average false positive rate of under 15\%.

%
% Chart for the False Positives Removed
%
\begin{figure}
\caption{Removal Methods False Positives}
\label{fig:FalsePositives}
\includegraphics[width=80mm]{FalsePositivesRemoval.png}
\end{figure}

To measure the effectiveness of the duplicate detection mechanisms, we also measured the number of duplicate events that were not detected and removed. This means that some events were left in the output of the system that should have been removed as duplicate events. However, the removal methods utilized by ChronoSearch were not entirely complete, which is to be expected. The results of how many duplicate events were not detected for each person entity run are shown in Figure ~\ref{fig:DuplicatesNotDetected}. As an example, for the Bill Gates run, there were 10 resultant event descriptions in the output that circumvented the removal mechanisms in the system. A total of 10 events out of the 51 total duplicates were not removed, which means that nearly 20\% of the total duplicate events pertaining to Bill Gates were missed. However, on average the 2 duplicate event detection mechanisms, cosine similarity and verb similarity, were able to find and remove nearly 85\% of the total duplicate events. This means that the average effectiveness of the duplicate detection methods was favorable. 

%
% Chart for the Duplicate Events Not Detected
%
\begin{figure}
\caption{Duplicate Events Not Detected}
\label{fig:DuplicatesNotDetected}
\includegraphics[width=80mm]{DuplicatesNotDetected.png}
\end{figure}
